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A Familiar Story When Building PB Sized Storage Systems 


= Center manager is negotiating with vendor for updated system 
= Focused attention given to 


CPU architecture 

Memory architecture 

Bus architecture 

Network topology and technology 
Linpack performance 

Qualifying for Top 500 

Power and cooling 


= Oh, almost forget storage... 


“Give me what | had, only more of it.” 


= System performance is compromised by inadequate storage I/O 
bandwidth 
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Storage Capacity, Performance Increases over Time 





= 1965 = 2008 
= Capacity < 205 MB = SATA 
= Streaming data rate < 2 MB/s (26 = Capacity < 1000 GB 
platters laterally mounted) - Streaming data rate < 105 MB/s 
= Rotational speed = 1200 RPM = Rotational speed = 7200 RPM 
i jia wae = Average seek time = 9 ms 
l Bane data rate < 3 MB/s (2 . aan een 
a | ' 
spindles) = Capacity < 450 GB 


= Streaming data rate < 425 MB/s 
= Rotational speed = 15 Krom 
= Average seek time = 3.6 ms 


= Rotational soeed = 3600 RPM 
= Average seek time = 12 ms 

= 1996 
= Capacity < 9 GB 
= Streaming data rate < 21 MB/s 
= Rotational speed = 10 Krom 
= Average seek time = 7./ ms 
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the System Upgrade 


= System administrators are generally responsible for 
“operationalizing” system upgrades. 


= The following pages provide some common and some not so 
common cases of processing centers scaling to the PB range. 





Common Scenario #1 


ao IBM Décp:Comiputing 
= Juan currently manages a small cluster 
= 64 Linux nodes with SAN attached storage 


= Storage = 25 TB (64 x 146 GB FC disks + 64 x 300 GB FC disks) 
= Juan's new cluster will be much larger 


= 256 Linux nodes with future upgrades up to 512 Linux nodes 
= Raw capacity starting at 200 TB increasing up to 0.5 PB 
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Common Scenario #2 














= Soo Jin’s company has a variety of computer systems that are 
independently managed 
= Modest cluster of 128 Linux nodes with a clustered file system 


= Several smaller clusters consisting of 16 to 64 Linux or Windows nodes 
accessing storage via NFS or CIFS 


= Several SMP systems with SAN attached storage 
= 2 types of storage 
= FC and SAS disk: 100 TB 
= SATA: 150 TB 
= Soo Jin has been asked to consolidate and expand the company’s 
computer resources into a new system configured as a cluster 
= 512 Linux nodes with future upgrades up to 1024 Linux nodes 
= No more SMP systems 
= Raw disk capacity starting at 0.5 TB increasing up to 1 PB 
= Must provide tape archive 
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Common Scenario #3 











» Lynn manages a small cluster with a large storage capacity 
= Small cluster of 32 nodes (mixture of Linux and Windows) 
: All storage is SAN attached 
= 3 classes of storage 
= FC disk ~= 75 TB (256 disks behind 4 controllers) 
= SATA disk ~= 360 TB (720 disks behind 3 controllers) 
= Tape archive approaching 1 PB 
= Lynn's new system will double every 18 months for the next 5 years 
with similar usage patterns 
= With the next upgrade, Lynn's storage must be more easily 
accessible to other departments and vice-verse; currently files are 


exchanged using ftp, scp or exchanging tape cartridges. One 
department has a cluster consisting of 256 Linux nodes. 
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Not as Common Scenario #4 
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= Abdul currently manages a moderate sized university cluster 
= 256 Linux nodes 


= Storage 


= 20 TB of FC disk under a clustered file system for fast access 
= 50 TB of SATA disks accessible via a NFS system 
= Abdul new cluster will be much larger 
= 2000 Linux nodes 


= 2 large SMP systems (e.g., 64 cores) using a proprietary OS 
= Storage capacity = 5 PB 
= Mixed I/O profile: 

= Small file, transaction access 

= Large file, streaming access 








Le, aa a 





eS Š 4 png Ead FA Ta ee 
c+ ae IB Dece.cthibating 


Lots of Questions 




















What is my I/O profile? 

How can | control cost? 

How do | configure my system’? 
Should | use a LAN or SAN approach? 
What kind of networks do | need? 


= Can I extend my current solution, or do | need to start with a whole 


o 


E 





new design? 


Given the rate of growth in storage systems, how should I plan for 
future upgrades? 


What is the trade-off between capacity and performance? 
Can | use NFS or CIFS, or do | need a specialized file system? 


What are the performance issues imposed by a PB sized file 
system? 
= streaming rates, IOP rates, metadata management 
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Understanding Your User Profile 











= Cache Locality 
= Working set: a subset of the data that is actively being used 
= Spatial locality: successive accesses are clustered in space 
= Temporal locality: successive accesses are clustered in time 
= Optimum Size of the Working Set 
= Good spatial locality generally requires a smaller working set 
= Only need to cache the next 2 blocks for each LUN (e.g., 256 MB) 
= Good temporal locality often requires a larger working set 


= The longer a block stays in cache, the more likely it can be accessed 
multiple times without swapping 


= Generic file systems generally use virtual memory system for cache 
= Favor temporal locality 
= Can be tuned to accommodate spatial locality (n.b., vmtune) 
= Virtual memory caches can be as large as all unused memory 
= Examples: ext3, JFS, Reiser, XFS 
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Understanding Your User Profile 


= Common Storage Access Patterns 
= Streaming 
= Large files (e.g., GB or more) with spatial locality 
Performance is measured by bandwidth (e.g., MB/s, GB/s) 
= Common in HPC, scientific/technical applications, digital media 
= IOP Processing 
Small transactions with poor temporal and poorer spatial locality 
= small files or irregular small records in large files 
= Performance is measured in operation counts (e.g., |OP/s) 
Common in bio-informatics, rendering, EDA, home directories 
= Transaction Processing 
= Small transactions with varying degrees of temporal locality 
Databases are good at finding locality 
Performance is measured in operation counts (e.g., |OP/s) 
Common in commercial applications 
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Understanding Your User Profile 


= Most environments have mixed access patterns 
= If possible, segregate data with different access patterns 


= Best Practice: do not place home directories on storage systems 
used for scratch space 


= Best practice: before purchasing a storage system 
= Develop “use cases’ and/or representative benchmarks 
= Develop file size histogram 
= Establish mean and standard deviation data rates 


= Rule of thumb: “Design a storage system to handle data rates 3 or 4 
standard deviations above the mean.” 


= John Watts, Solution Architect, IBM 
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Understanding Your User Profile 


= Use Cases 
Benchmarks based on real applications 
Provide the best assessment of actual usage 
Carefully select representative workload 
Can be difficult to use 
Requires more time to evaluate then with synthetic benchmarks. 
= Can you give the data/code to vendor to use? 
Is vendor willing to provide “loaner” system to customer? 
= Synthetic benchmarks 
Easier to use and results are often published in white papers 
= Vendor published performance is usually based on synthetic benchmarks 
= But do they use a real file system configured for production environment? 
= Select benchmark codes that correlate to actual usage patterns 


= Ifa storage system meets a stated performance objective using a given 
benchmark, then it will be adequate for my application environment 


= Common examples 
= Bonnie++, IOR, iozone, xdd, SpecFS 
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Cost vs. Capacity vs. Performance vs. Reliability 





= Do you want to optimize 
= Streaming performance 
= IOP performance 
= Capacity 
= Cost 


= Reliability 
= How much can you spend to get what you need? 


= Gripe: Accountants should not dictate technical policy! 
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Cost vs. Capacity vs. Performance vs. Reliability 


= Enterprise Class Disk Optimizes reliability as well as 
= Fibre Channel (FC) Disk streaming and IOP 
= Serial Attached SCSI (SAS) performance. 


= Common Sizes: 146, 300, 450 GB 

= MTBF = 1.4 MHour 

= Rotational speed = 15 Krom 

= Single drive IOP rate, 4K transactions (no caching): 380 IOP/s 

= Single drive streaming rate* via RAID controller 
= Controller cache disabled: write = 50.8 MB/s, read = 95.4 MB/s 
= Controller cache enabled: write = 154.6 MB/s, 123.6 MB/s 

= Best practice: Configure using RAID 3 or RAID 5 
= 4+P or 8+P is common 


*Based on DS4800 benchmark accessing the “raw disk” via dd. 
dd buffer size = 1024K, cache block size = 16K, segment size = 256K 
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Cost vs. Capacity vs. Performance vs. Reliability 


-= Cost Optimized Disk Optimizes capacity. 
= Serial ATA (SATA) Disk Streaming performance and 
- Common Sizes: 750, 1000 GB reliability are often good enough. 
= Larger sizes net generally in many current generation controllers 
= MTBF = 0.7 MHour 


= The MTBF rating is being replaced by annualized failure rate (AFR) which is 0.34% on 
representative SATA disks 


= Rotational speed = 7200 RPM 

= Single drive IOP rate, 4K transactions (no caching): 70 IOP/s 
= Command tag queuing (NCQ) can increase this rate to 120 IOP/s 

= Single drive streaming rate* via RAID controller 
= Controller cache disabled: write = 18.5 MB/s, read = 59.2 MB/s 
= Controller cache enabled: write = 30.3 MB/s, 74.9 MB/s 


= Best practice: Configure using RAID 6, especially in larger storage 
systems 


= 8+P+Q is common 


*Based on DS4700 benchmark accessing the “raw disk” via dd. 
dd buffer size = 1024K, cache block size = 16K, segment size = 64K 
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Cost vs. Capacity vs. Performance vs. Reliability 


= For PB sized file systems, SATA may be good enough! 
= Depends in part on how the storage controller manages RAID 
240 SATA disks yield similar streaming performance to 128 FC disks” 
= SATA IOP rates are much less the FC IOP rates given poor locality 


= SATA using RAID 6 “levels the playing field” compared with FC using RAID 5 
= RAID 6 significantly lowers the risk of data loss due to “dual disk failures” 
= RAID capacity overhead is similar for 8+2P RAID 6 and 4+P RAID 5 
= RAID rebuild times with SATA/RAID 6 are longer than FC/RAID 5; this may be 
exacerbated by more frequent RAID rebuilds for SATA 
some storage controllers can in part compensate for this 
= Usable Capacity for SATA is much greater than FC disks 
= SATA with 8+2P RAID 6: 240 x 1 TB < 192 TB 
= FC with 4+P RAID 5: 128 * 450 GB < 46 TB 


*Based on DS5300 benchmarks using the EXP5000 trays with 15Krom FC and EXP5060 trays with 7200 RPM SATA 


The trade-off point is different for different storage controllers. 
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Cost VS. Capacity vs. Performance vs. Reliability 


= Reduce Cost Using Storage 


Hierarchy | 
- Multiple storage tiers Tier-4 Pe 
= Tier 1: Enterprise class - Fast disk high BW/low latency 
FC SAS) ° e.g., FC disk more expensive 
= Tier 2: Cost optimized storage aen Space 
= SATA Tier-2 
= Tier 3: Tape stored in libraries > High capacity disk 
= Tier 4: Tape stored off-site °e.g., SATA | 
- Backup vs. Archive + Infrequently used files 
= Archive — single copy of data Tier-3 
= Backup — multiple copies of - Local tape libraries infrequent use 
data Tier-4 — —— 
= Best practice: integrate disk + Remote tape libraries higher latency 
and tape layer less expensive 
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Cost vs. Capacity vs. Performance vs. Reliability 


= Realistically Assess Uptime and Availability Requirements 
= Is a quality of service (QOS) guarantee necessary 
= Example: guaranteeing full performance in spite of component failures 
= Percentage of uptime requirements 
= 99.999% uptime ~= 5 min of down time per year 
= 99.99% uptime ~= 1 hour of down time per year 
= 99.9% uptime ~= 9 hours of down time per year 
= Guaranteed access to data 
= |f this is a requirement... 
= Is access to all data in your data store necessary? 
= ls immediate access to the data necessary? 
= Design disaster recovery procedures 
= Setting artificially high standards requires redundant systems and 
unnecessary cost. 
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Cost vs. Capacity vs. Performance vs. Reliability 


Considerations for Re-provisioning Legacy Storage 
= Can I preserve my investment? 


= Can |I save money doing it? Does the cost of re-provisioning storage 
exceed its value? 


= Does it lock me into older technology that is no longer optimum for my 
application environment? 


Is it feasible to segregate legacy storage and new storage’ 
If this is true, this is generally the easiest way to do it. 


If not, is there an appropriate software product for my environment that 
can integrate them? 


= Re-provisioning storage hardware is a common requirement. 


Many file systems can accommodate this requirement to varying 
degrees. 


= There are also specialized software products that can also do this. 
When other strategies are not feasible, NFS is often “good enough’. 
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Building Block ‘Strategy 
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= Building Block Concept 


= Define a smallest common storage unit consisting of servers, 
controllers and disks 
= Replicate it multiple times until capacity and performance requirements 
are satisfied 
= Leads to a “build out as you grow’ strategy 
= Issues 
= Building blocks work best with LAN based file systems 
= Today's storage technology is well suited for large building blocks which 
is appropriate for PB sized storage systems! 
= Controller cost/architecture make small building blocks less feasible 
= Small building blocks are not as effective in PB sized file systems 


= Small building blocks increase component counts which increases the 
risk of failure, yet they can have excellent price/performance curves 


= Building block design is often dictated by the choice of file system 
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= Balance 


= Ideally, an I/O subsystem should be balanced 


= Do not make one part of storage system fast and another slow 


Overtaxing some components of the I/O subsystem may disproportionately degrade 
performance 


= Warning: customer requirements may make this goal unachievable 
= “Performance is often inversely proportional to capacity.” 
= Todd Virnoche, Business Partner Enablement, IBM 


= Number of disks needed to meet capacity exceeds performance 


= Number of disks needed to meet capacity yields greater performance than 
needed 


Common example: data warehouses 


= Number of disks needed to meet performance exceeds capacity 


Common example: national labs, university computing centers 
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Building Block Example 
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Example #1A — Large Building Block, Performance Optimized 


Ethernet Switch* 
TbE - storage: GbE - sysadm 
Å) 
S 01 =e 
ea | 2xFC8 H= 
nehalem 
8 cores, 6 DIMMs 














GbE| GbE 


I 
Server-02 
nehalem 


GbE| | GbE | 8 cores, 6 DIMMs 


Server-03 
nehalem T 
GbE| [GbE 8 cores, 6 DIMMs | | 


Server-04 


















































nehalem 
GbE. 8 cores, 6 DIMMs 














x Either an IB or Ethernet LAN can be used for storage access. 
+ Uniform distribution can be difficult to sustain over bounded channels 


Performance Analysis 

DCS9900 Performance 

> Streaming data rate < 5.6 GB/s 

>» Noncached IOP rate < 40,000 IOP/s 

LAN: 4xDDR IB HCA (RDMA) 

> Potential peak data rate per HCA < 1500 MB/s 

>» Required peak data rate per HCA < 1200 MB/s 
SAN: 2xFC8 (dual port 8 Gbit/s Fibre Channel) 

> Potential peak data rate per 2xFC8 < 1500 MB/s 
> Required peak data rate per 2xFC8 < 1200 MB/s 





8 x FC8 host connections 


S2A9900 (2U) 
RAID Controller 20 


C1 host ports 





S2A9900 (2U) 
RAID Controller mT 
C2 host ports 
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O 
5 Disk trays o 
32 disks per tray o 





Capacity Analysis 
> SAS @ 15Krpm 


160 x SAS disks 
Minimum required 
to saturate couplet 

performance 


-160 disks @ 450 GB/disk 


-16 x 8+2P RAID 6 tiers 
-Capacity < 72 TB 


PC Ratio = 78 MB/s/ TB 
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Building Block Example 
Example #1A (IB) — 2 Building Blocks, Performance Optimized 


LAN connections to office 
systems via CIFS or NFS | . SSS E *—_ | 
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GbE LAN not shown m m 
Assume each node and controller has GbE connection t—————— Frame #2 Ag g reg ate Statisti CS 
Frame #1 «64 client nodes 
Beat «Streaming < 11 GB/s 
32 x 15Krpm 32 x 15Krpm caen _ 
SAS disks/tray SAS disks/tray one "Avg ~= 180 MB/s per node 
eon "Requires IB to be BW effective 
IB Switch = «Capacity < 144 TB 
i Server - 01 Server - 05 n 1320 disks * 450 GB/disk 
a Server - 02 Server - 06 ieni -10 
q= es Za IE 
—— E T Scaling to PB Range 
sg S2A9900 4  S2A9900 TT f 
Æ Couplet $ Couplet TIET sRequires 14 bldg blocks 
client - 17 . 
Tray #1 Tray #1 cient «Streaming < 78 GB/s 
client - 
Tray #2 Tray #2 sel 
client - 
Tray #3 Tray #3 = 
client - 
Tray #4 Tray #4 oe 
client - 
Tray #5 Tray #5 rer- 
client - 
Storage Servers and Controllers 64 Storage Clients 
24 GbE IB FC8 
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Building Block Example 
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Example #1B — Large Building Block, Capacity Optimized 


Ethernet Switch (sysadm) 


Server-01 
nehalem 

GbE) GbE 8 cores, 6 DIMMs 
Server-02 
nehalem a 
8 cores, 6 DIMMs 


GbE| GbE 










































Server-03 
nehalem 
8 cores, 6 DIMMs 


Server-04 [axes HO 
nehalem 
‘GbE 8 cores, 6 DIMMs 


GbE | 


IB Switch* (storage) 


Performance Analysis 

DCS9900 Performance 

> Streaming data rate < 5.6 GB/s 

>» Noncached IOP rate < 40,000 IOP/s 

LAN: 4xDDR IB HCA (RDMA) 

> Potential peak data rate per HCA < 1500 MB/s 

>» Required peak data rate per HCA < 1200 MB/s 
SAN: 2xFC8 (dual port 8 Gbit/s Fibre Channel) 

> Potential peak data rate per 2xFC8 < 1500 MB/s 
> Required peak data rate per 2xFC8 < 1200 MB/s 


*Either an IB or Ethernet LAN can be used for storage access. 
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1L 
2U 
host ports 


S2A9900 (2U) 
RAID Controller 
C1 





8 x FC8 host connections 


S2A9900 (2U) TE 
RAID Controller oT 
C2 host ports 





300 x SATA disks 
minimum to saturate 
couplet performance 





Capacity Analysis 


» Balanced capacity/performance 


> 300 x SATA disks 
-5 disk trays 


-30 x 8+2P RAID 6 tiers 


-Capacity < 300 TB 
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host ports 
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4| 
host ports 


O 1200 x SATA disks 
(e maximize capacity 


300 x SATA disks 

PC Ratio = 18 MB/s / TB 
1200 x SATA disks 

PC Ratio = 4.6 MB/s / TB 


> Purely capacity optimized 


> 1200 x SATA disks 


-20 disk trays 
-120 x 8+2P RAID 6 tiers 


-Capacity < 1.2 PB 
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Building Block Example 
Example #1B (TbE) — 2 Building Blocks, Balanced Performance/Capacity 


systems va GIFS or NFS pre cI CI Aggregate Statistics 


=256 client nodes 
"Streaming < 11 GB/s 
Ethernet Switch "Avg ~= 45 MB/s per node 


60 x SATA DLETA cient -0 “IB is overkill for this storage system 


disks per tray disks per tray TE GbE is adequate for 256 nodes unless 
client- 05 there is a large variance in the workload 


client - 06 


Ethernet Switch client -07 requiring short bursts of high bandwidth. 
Server - Server - 05 m cient- 09 «Capacity < 600 TB 
a =| cient- 11 -600 disks * 1 TB/disk 


client - 12 
Server - Server - 08 - client - 13 


S2A9900 F S2A9900 m : 
Couplet © Couplet TT Scaling to PB Range 
client - 17 z 
Tray #1 Tray #1 gen «Requires 4 bldg blocks 
cient -20 «Streaming < 22 GB/s 


Tray #2 Tray #2 Haw 


client - 
client - 

client - 
Tray #3 Tray #3 = 
client - 

client - 
Tray #4 Tray #4 aoe 
client - 
client - 
client - 
client - 


—— Frame #8 


itch 





Frame #1 


















































































































































































































































































































































Tray #5 Tray #5 


























Storage Servers and Controllers 256 Storage Clients 
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Example #1B (TbE) — 1 Building Block, Capacity Optimized 


LAN connections to office 
systems via CIFS or NFS 
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S2A9900 
Couplet 








Tray #1 








Tray #2 








Tray #3 








Tray #4 








Tray #5 








Tray #6 








Tray #7 








Tray #8 








Tray #9 











Tray #10 
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Tray #11 


Ethernet Switch 
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01 














Tray #12 


Server - 
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Server - 
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Tray #13 


Server - 
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Tray #14 








Tray #15 








Tray #16 








Tray #17 








Tray #18 








Tray #19 











Tray #20 
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Frame #8 





Frame #1 
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Ethernet Switch 
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256 Storage Clients 











Aggregate Statistics 
=256 client nodes 
"Streaming < 5.6 GB/s 
"Avg ~= 22 MB/s per node 
“|B is overkill for this case 
GbE is adequate for 256 
nodes unless there is a 
large variance in the 


workload requiring short 
bursts of high bandwidth. 


«Capacity < 1.2 PB 


Scaling to PB Range 


=Not necessary... this is a PB! 
Caution: If the client cluster is large 
(e.g., 1024 nodes), the data rate per 
node will be very small (e.g., 5 MB/s 
per node). If the variance is large, 
this may then be less of an issue. 
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Building Block Example 


Summary Example — Capacity vs. Performance, IB vs. Ethernet 
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LAN connections to office 
systems via CIFS or NFS 


Aggregate Statistics 
"8 building blocks 
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GbE 





4 Servers 








S2A9900 
and 
5 Trays 








4 Servers 











S2A9900 
and 
5 Trays 











4 Servers 








S2A9900 
and 
5 Trays 








4 Servers 











S2A9900 
and 
5 Trays 
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4 Servers 








S2A9900 
and 
5 Trays 








4 Servers 











S2A9900 
and 
5 Trays 











4 Servers 








S2A9900 
and 
5 Trays 








4 Servers 











S2A9900 
and 
5 Trays 


















Frame #32 








Frame #1 
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Ethernet Switch 








client - 01 








client - 02 








client - 03 








client - 04 








client - 05 








client - 06 








client - 07 
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client - 09 








client - 10 
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client - 12 
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client - 14 








client - 15 
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client - 18 








client - 19 








client - 20 








client - 21 








client - 22 








client - 23 
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client - 25 








client - 26 
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«1024 client nodes 
«Using building block #1A 
«Streaming < 45 GB/s 
"Avg ~= 45 MB/s per node 
«Capacity < 5/6 TB 
"PC Ratio = 80 MB/s per TB 
«Using building block #1B (balanced) 
«Streaming < 45 GB/s 
"Avg ~= 45 MB/s per node 
«Capacity < 2.4 PB 
"PC Ratio = 19 MB/s per TB 


IB vs. Ethernet 
«Ethernet is adequate for storage access 
“Avg ~ = 45 MB/s < GbE ~= 80 MB/s 
"assumes peak bandwidth per node < 80 MB/s 
=Assume one or both of the following 
=Peak client storage rate > 80 MB/s 
"Avg message passing rate > 35 MB/s 
= Two possible solutions 


«Create dedicated GbE LAN for message passing 
«Use IB LAN instead 










A may Sr 
 —— 1. - JBM DeepsComputing 


Building Block Example — Common Mistake 
Example #1B (TbE) — 2 Building Blocks 


LAN connections to office Windows Mac Linux 
systems via CIFS or NFS Aggregate Statistics 


Sea =256 client nodes 
p mi “Streaming < 11 GB/s 
S2A9900 S2A9900 == a Avg ~= 45 MB/s per 


Couplet Couplet Ethernet Switch 
client - 01 node 
client - 02 
client - 03 
client - 04 
client - 05 


Tray #2 Tray #2 E Common mistake 


client - 07 


Tray #3 Tray #3 cere 7 SATA 


client - 09 


client - 10 a? X Couplets 


client - 11 


Tray #4 Tray #4 client - 12 «600 x 1 TB SATA < 600 TB 


client - 13 
client - 14 
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Tray #1 Tray #1 






































































































































Tray #5 Tray #5 client - 15 
client - 16 
client - 17 
client - 18 
































client - 19 


Ethernet Switch client - 20 
client - 21 
Server - client - 22 


client - 23 
rver - 
serve client - 24 


Server - client - 25 










































































Server - client - 26 


client - 27 
Server - client - 28 



































Server - client - 29 


Server - client - 30 
client - 31 
Server - client - 32 
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Building Block Example — Common Mistake 
Example #1B (TbE) — 2 Building Blocks 


LAN connections to office Windows Mac Linux 
systems via CIFS or NFS Aggregate Statistics 


Sea =256 client nodes 
p mi “Streaming < 11 GB/s 
S2A9900 S2A9900 == a Avg ~= 45 MB/s per 


Couplet Couplet Ethernet Switch 
client - 01 node 
client - 02 
client - 03 
client - 04 
client - 05 


Tray #2 Tray #2 client - 06 C O m m O n m ista ke 


client - 07 


Tray #3 Tray #3 a 5 SATA VS. SAS 


client - 09 


client - 10 a? X Couplets 


client - 11 


Tray #4 Tray #4 client - 12 «600 x 1 TB SATA < 600 TB 


client - 13 


EF e sient 1 =1200 x 450 GB < 540 TB 


client - 15 


client- 16 "Streaming performance is 


client - 17 
z : , i 
Tray #6 Tray #6 client- 18 identical 
client - 19 
Ethernet Switch aeni -20 
Tr ay #7 Tray #7 it -21 
Server - 01 client - 22 
Server - 02 client - 23 


lient- 24 
Tray #8 Tray #8 Server - 03 TA -25 


Server - 04 client - 26 

client - 27 
Tray #9 Tray #9 Server - 05 client - 28 
Server - 06 client - 29 


Tray #10 Tray #10 Server - 07 — 


Server - 08 client - 32 












































Tray #1 Tray #1 































































































































































































































































































































































































Storage Servers and Controllers 256 Storage Clients 





GbE TbE 











FC8 




















Pi e 
S ra 


of Er 
i ae 
4 a a, an 








sabe 


ilding Block Example 





a 
JBM Dee © Computing 


Example #2A — Small Building Block, Performance Optimized 


Ethernet Switch 
Servet?) TbE 2 x Fc4 5 
8 cores, 6 DIMMs 
GbE GbE nehalem 2x FC4C 
Server-02 | 
$ 8 cores, 6 DIMMs 2 xFC4 “i 
GbE| cbe nehalem = 


Conroller with Internal Disks 
10 x 15 Krpm SAS disks (450 GB/disk) 
R 


C] ESM-A C C7 ESM-B C 














— 























Disk Enclosure 
10 x 15 Krom SAS disks (450 GB/disk) 


Conroller with Internal Disks 
10 x 15 Krom SAS disks (450 GB/disk) 


a De 


4+P 
RAID 5 


Disk Enclosure 
10 x 15 Krom SAS disks (450 GB/disk) 


31 


«Storage Servers 
=2XFC4 < 780 MB/s 
=TbE < 725 MB/s 
«Storage Controller 
= Twin tailed” disks 
«20 disks per controller 
=15Krom FC disks 
«Write rate < 650 MB/s 
"Read rate < 800 MB/s 
«Capacity <9 TB 
"Aggregate Statistics 
«Data rate < 1450 MB/s 
«Capacity < 18 TB 
«PC Ratio = 80 MB/s/ TB 


Multiple servers, controllers and 
ports guarantee resilience. 


8+2P 
RAID 6 
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Example #2B — Small Building Block, Capacity Optimized 


Ethernet Switch 


Server-01 


[ 2 x Foca 
8 cores, 6 smi = 
GbE nehalem x FC4 “A 
Server-02 | 


8 cores, 6 DIMMs E ace aT 


GbE nehalem 2x = 


= aa 
L] Controller-A C1 C]  Controller-B 3=(LL 1 


Conroller with Internal Disks 
10 x SATA disks (1 TB/disk) 


























3 x Disk Enclosures 
30 x SATA disks (1 TB/disk) 


— rs 
È] Controlle-A CL] C] ç Controller-B C] 


Conroller with Internal Disks 
10 x SATA disks (1 TB/disk) 


3 x Disk Enclosures 
30 x SATA disks (1 TB/disk) 






"Storage Servers 
=2XFC4 < 780 MB/s 
=TbE < 725 MB/s 
"Storage Controller 
= Twin tailed” disks 
=40 disks per controller 
«SATA disks 
«Write rate < 650 MB/s 
"Read rate < 800 MB/s 
«Capacity < 40 TB 
Aggregate Statistics 
«Data rate < 1450 MB/s 
«Capacity < 80 TB 
„PC Ratio = 18 MB/s/ TB 


Multiple servers, controllers and 


ports guarantee resilience. 
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Example #2 — Miscellaneous Comments 
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= Example #1A 


= There is room for 24 disks per disk controller, but 20 x 15Krpm disks in 
a 4+P RAID 5 configuration maximize the streaming performance of the 
controller. 
= In practice, 2 more disks are frequently included as “hot spares”. 


= To maximize IOP rate, the number of disks can be increased up to 48 
per controller. 


= Example #2A 
= There is room for 48 disks per disk controller, but 40 x SATA disks ina 


4+2P RAID 6 configuration maximize the performance the controller. 
= Caution 
= JBOD configuration increases the performance to capacity ratio, but the 
risk exposure of data loss in large configurations is unacceptably high. 


= While the streaming performance of these 2 solutions is similar, the IOP 
rate for the SATA solution is much less. 
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Example #2A — 2 Building Blocks, Performance Optimized 
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LAN connections to office 
systems via CIFS or NFS 


- - Aggregate Statistics 
"Streaming < 3 GB/s 


Ethernet Switch* GbE & ThE «Capacity < 36 TB 























































































































Frame #1 (32/42 U) 














Scaling to PB Range 
"Requires 28 bldg blocks* 
"Streaming < 16 GB/s 
«Need > 500 GbE clients in order to 
fully utilize BW 
«Small building block issues to be 
managed: 


IN «Complexity of managing 28 controllers 









































Server - 1 
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Server - 2 
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Controller #1 

















Enclosure #1.1 





























Controller #2 
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«Controller failure (more controllers 
implies decreased MTBF) 





Server - 3 
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Server - 4 
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Controller #3 























































































































+This is a good example of “give me the same thing, 
only bigger”. In practice, if this solution is scaled out 
to a PB, it will be difficult to administer and maintain. 























Enclosure #3.1 








Controller #4 








Enclosure #4.1 





























































































































































































































GbE ThE 





FC4 Drive side cabling not shown. 





* For greater redundancy, the Ethernet fabric can be deployed over 2 switches 
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Frame #1 (34/42 U) 





Server - 1 








Server - 2 








Controller #1 








Enclosure #1.1 








Enclosure #1.2 
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Building Block Example 
Example #2B — 2 Building Blocks, Capacity Optimized 


Ethernet Switch* GbE & [bE 


Frame #2 (34/42 U) 





Server - 3 








Server - 4 








Controller #3 








Enclosure #3.1 








Enclosure #3.2 








Enclosure #1.3 








Enclosure #3.3 








Enclosure #1.4 








Enclosure #3.4 








Controller #2 








Controller #4 








Enclosure #2.1 








Enclosure #4. 1 








Enclosure #2.2 








Enclosure #4.2 








Enclosure #2.3 








Enclosure #4.3 








Enclosure #2.4 











Enclosure #4.4 











GbE TbE 














FC4 
























































































































































































































































































































































































































































Drive side cabling not shown. 
* For greater redundancy, the Ethernet fabric can be deployed over 2 switches 
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Aggregate Statistics 
«Streaming < 3 GB/s 
«Capacity < 160 TB 


Scaling to PB Range 
=Requires 12 bldg blocks 
«Streaming < 16 GB/s 
«Need > 200 GbE clients 
in order to fully utilize BW 
«Small building block issues 
to be managed: 
“RAID rebuild time 
«Controller failure (more 


controllers implies 
decreased MTBF) 


Impact of 2 TB SATA Drives 
"Lower PC ratio = 9 MB/s / TB 
"Longer RAID rebuild times 
=Requires only 6 building blocks 
lowering the component count to 
something manageable. 
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Ethernet Switch (GbE) 


FC Switch (4 Gb/s) 







































































Controller 








Enclosure #1 





























Enclosure #2 










































































Enclosure #3 








Enclosure #4 








Enclosure #5 





























Enclosure #6 










































































Enclosure #7 








Enclosure #8 




















NFS Server 























Samba Server 




































































FC4 
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Bonded 
2xGbE 
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Current System 
"56 blades, each with GbE and FC4 port 
"Desktop access 
«NFS Server with 2xFC4 and 2xGbE 
«Samba Server with 2xFC4 and 2xGbE 
«Storage Controller Under a SAN File System 
«Capacity < 150 TB 
sData rate 
"Aggregate rate: 1 to 2 GB/s 
“Average rate per node: 10 to 15 MB/s 
«Burst rate*: up to 200 MB/s 





Requirements for New Cluster 


«Phase 1 
=160 nodes with IB network 
«Capacity = 500 TB 
«Data rate 
“Aggregate rate: 3 to 4 GB/s 
"Average rate per node: up to 20 MB/s 
Burst rate*: up to 300 MB/s 


«Phase 2: everything doubles in 18 months 





























«Short bursts of activity occurring on several blades at any given time. 
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New System: “Give Me the Same Thing, Only Bigger” 
ii VSS NK NX 


NFS Server TbE 

































































Samba Server 

































































































































































$2A9900 COMMENT: 
Solera nora This is a good SAN design. 
While large (n.b., 168 nodes), it 
iHlooooo 3 
oH 00000 4 is not excessive and can be 














managed by most file system 








10 x 60-disk Drawers 













































SATA Disk 
600 x disks 

























































































































































































60 x 8+2P RAID 6 The issue is with future 
expansion. At this point, the 
Nodes largest SAN file systems in 
-168 nodes production consist of 256 nodes 
=1 x IB HCA per node connected by fibre channel; 
Storage they are not likely to get larger 
«Capacity in the near future. 


























Raw = 600 TB 
Usable = 480 TB 
«Sustainable data rates 


Aggregate < 5.6 GB/s 
Avg per node < 30 MB/s LAN based file systems cost 


Peak burst < 500 MB/s less and scale much larger. 






































If your node counts expand 


















































































































































(limited by blocking factor) 
* IB Blocking factor ~= 3:1 
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supporting a SAN architecture. 


proportionally to data capacity, 
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Performance Considerations: “Black Box Factor” 


= Ease of Use (high black box factor) 
= Advantages 


= Are generally considered easy to use and administer 
= Performance is “good enough” for many environments 
= Principle limitation 
= Lack flexibility and tuning options to adapt to specialized applications 
= Example: NAS devices 
: Flexibility 


= Advantages 


= Generally support a wide arrange of storage products 
= Provide wide range of tuning parameters making them adaptable to a 
wide range of applications 
= Limitations 
= More difficult to learn and use 


= Example: General purpose file systems 
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Performance Considerations: Seek Arm Mechanics 


= Seek arm movement dominates disk performance 
= 15Krpm FC Disk: 3.6 ms 
= (200 RPM SATA Disk: 9.0 ms 


= Therefore, write applications to move as much data as possible per 
seek operation. 


= Small files (e.g., 4K) are generally accessed in a random order which 


forces 1 seek arm movement per file for a correspondingly small chunk 
of data. 


= Large records in a large file allows the disk to access a large volume of 
data per seek arm movement thereby improving efficiency. 


= But rewriting legacy codes Is tedious and programming managers may 
not approve it. 
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Avoid Single Points of Failure in PB Sized Storage Systems 





many single points of failure fewer single points of failure 








LAN Switch (GbE) 
client - 01 
client - 02 
client - 03 
client - 04 
client - 05 
client - 06 
client - 07 
client - 08 
client - 09 
client - 10 
client - 11 
client - 12 
client - 13 
client - 14 
client - 15 
client - 16 
client - 17 
client - 18 
client - 19 
client - 20 
client - 21 
client - 22 
client - 23 
client - 24 
client - 25 
client - 26 
client - 27 
client - 28 
client - 
client - 
client - 
client - 














client - 
: | redundant 
client - | storage 
client - | servers” 
client - | 

client - | LAN Switch (TbE) 
client - | 

client - | 

client - | LAN Switch (TbE) 
client - | 
client - | Server - 01 


SIEM Server - 02 
client - 


len Server - 03 


client - | Server - 04 
client - | 
client - | DS5300 
client - | Fe Couplet 
client - | 
client - | 

client - | Tray #1 
client - 
client - 
client - 
client - 
client - 
client - 
client - 
client - | 

client - | Tray #4 


client - 


client - 
client - 
client - 
client - 










































































1 file system 
per 
NFS server 


client - | redundant 
client - | storage 
client - | servers* 
client - | 

client - | 
client - | LAN Switch 
client - | 

client - | Server - 01 
client - | 

client - 
client - | 
client - | Server - 04 
client - | 

client - | DS5300 
client - Æ Couplet 
client - | 

ae. Tray #1 
client - | 

client - | 


client - Tray #2 











































































































LAN Switch 










































































NFS Server - 01 
NFS Server - 02 
NFS Server - 03 
NFS Server - 04 


DS5300 
Couplet 












































Server - 02 
Server - 03 
























































































































































































































































Tray #1 

































































Tray #2 Tray #2 











client - 
client - 
client - 
client - 
client - 
client - 


client - Tray #4 


client - 










































































Tray #3 Tray #3 Tray #3 










































































Tray #4 






















































































x requires appropriate file system ¿»requires appropriate file system 


Increased redundancy can be achieved using 2xGbE per client and distributing the cluster over 
40 multiple sites. Carefully assess uptime requirements to avoid “gold plating” in this regard. 
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Principle Tools to Manage Storage 


= Benchmarking Tools 
= Synthetic benchmarks vs. use cases 


= System Monitoring Tools 
= Open source examples: ganglia, iostat, nmon, vmstat 


= Storage Controllers 
= Provide disk management and monitoring 
= Example OEMs: DDN, EMC, IBM, LSI 


= File Systems 
= The following pages take a closer look at file systems commonly used in 


clusters where PB sized file systems are common. Some of them are 
not as well suited for a PB scale as others. 
= Many file systems provide monitoring tools. 
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File System Taxonomy 


The following pages examine a taxonomy of file systems 
commonly used with clusters. They may or may not be a 
clustered file system and they support varying degrees of 
parallelism. They do not represent mutually exclusive choices. 


> Conventional I/O 

> Asynchronous [I/O 

> Networked File Systems 

> Network Attached Storage (NAS) 

> Basic Clustered File Systems 

+ SAN File Systems 

> Multi-ccomponent Clustered File Systems 


> High Level Parallel I/O 
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Conventional I/O 











> Used generally for "local file systems" 
e the basic, "no frills, out of the box" file system 


> Supports POSIX I/O model 


> Generally supports limited forms of parallelism 
e intra-node process parallelism 
e disk level parallelism possible via striping 
e not truly a parallel file system 

> Journal, extent based semantics 


e journaling AKA logging): to log information about operations performed 
on the file system meta-data as atomic transactions. In the event of a 
system failure, a file system is restored to a consistent state by replaying 
the log and applying log records for the appropriate transactions. 


e extent: a sequence of contiguous blocks allocated to a file as a unit and is 
described by a triple consisting of <logical offset, length, physical> 


> If they are a native FS, they are integrated into the OS (e.g., 
caching done via VMM) 


> Examples: ext3, JFS, NTFS, ReiserFS, XFS 
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Asynchronous I/O 


+ Abstractions allowing multiple threads/tasks to safely and 
simultaneously access a common file 
e non-blocking I/O 
e built on top of a base file system 


> Parallelism available if its supported in the base file system 
> Included in the POSIX 4 standard 

e not necessarily supported on all Unix operating systems 
> Examples: 

e commonly available under real time operating systems 


e Supported today on various "flavors" of standard Unix 
-AIX, Solaris, Linux (starting with 2.6) 
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Networked File Systems 


> Disk access from remote nodes via network access 
e generally based on TCP/IP over Ethernet 


e Useful for on-line interactive access (e.g., home directories) 


> NFS is ubiquitous in Unix/Linux environments 

edoes not provide a genuinely parallel model of I/O 
-it is not cache coherent (will future versions like DNFS correct this?) 
-parallel write requires O._ SYNC and -noac options to be safe 

e poorer performance for HPC jobs, especially parallel I/O 
-write: only 90 MB/s on system capable of 400 MB/s (4 tasks) 
-read: only 381 MB/s on system capable of 740 MB/s (16 tasks) 

e uses POSIX I/O API, but not its semantics 


e traditional NFS configurations limited by "single server" bottleneck 


ewhile NFS is not designed parallel file access, by placing restrictions on an 
application's file access and/or doing non-parallel I/O, it may be possible to 
get "good enough" performance 


e NFS clients available for Windows, but POSIX to NTFS mapping is awkward 
e GPFS provides a high availability version of NFS called Clustered NFS 


> CIFS is ubiquitous in Windows environments 


e Samba is a CIFS server available under Unix/Linux that maps a POSIX 
based file system to the Windows/NTFS model. 
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Networked File Systems 


file file file file file file 
client | | client | | client || client | | client | | client 


LAN 








COMMENT: 


Traditionally, a single NFS/CIFS file server manages both user data and metadata 

operations which "gates" performance/scaling and presents a single point of failure risk. 

Products (e.g., CNFS) are available that provide multiple server designs to avoid this issue. 
46 
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Network Attached Storage (AKA: Appliances) 


> Appliance Concept 


e Traditionally focused on the CIFS and/or NFS protocols 


e Integrated HW/SW storage product 
-integrates servers, storage controllers, disks, networks, file system, protocol, etc. all 


into single product 
-main advantage: "black box" design (i.e., ease of use at the expense of flexibility) 


-not intended for high performance storage 

e Provides an NFS server and/or CIFS/Samba solution 
-these are server based products; they do not improve client access or operation 
-may support other protocols (e.g., iSCSI, http) 

e Generally based on Ethernet LANs 

els this just a subclass of the networked file systems level’? 


> Examples 


e Netapp 
-Provides excellent performance for IOPS and transaction processing workloads with 


favorable temporal locality. 


e Scale-out File System (SoFS) 
- Supports CIFS (Samba), http, iSCSI, NFS, NSD (i.e., GPFS) protocols 
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Basic Clustered File Systems 


> Satisfies the definition of a clustered file system 
> File access is parallel 
e supports POSIX API, but provides safe parallel file access semantics 
-guarantees portability to other POSIX based file systems 
- File system overhead operations 
e file system overhead operations is distributed and done in parallel 
e there are no single server bottlenecks 
-n.b., no metadata servers 
> Common component architecture 


e commonly configured using seperate file clients and file servers 


-this is common for reasons of economy; for many storage systems, it costs too 
much to have a seperate storage controller for every node 


e some FS's allow a single component architecture where file clients and file 
servers are combined (/.e., no distinction between client and server) 


-yields very good scaling for asynchronous applications 
> file clients access file data through file servers via the LAN 
> Example: GPFS, GFS, IBRIX Fusion 
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Basic Clustered File Systems 


file file 
server | | server 


File system overhead operations are distributed across the entire 
cluster and is done in parallel; it is not concentrated in any given place. 
There is no single server bottleneck. User data and metadata flows 
betweem all nodes and all disks via the file servers. 
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> File access Is parallel 
e supports POSIX API, but provides safe parallel file access semantics 
-guarantees portability to other POSIX based file systems 


+ File system overhead operations 
eit is NOT done in parallel 
e single metadata server with a backup metadata server 
-metadata server is accessed via the LAN 


-metadata server is a potential bottleneck, but it is not considered a limitation since these 
FS's are generally used for smaller clusters 


> Dual component architecture 
e file client/server and metadata server 


> All disks connected to all file client/server nodes via the SAN 


e file data accessed via the SAN, not the LAN 
-removes need for expensive LAN where high BW is required (e.g., IB, Myrinet) 
e inhibits scaling due to cost of FC Switch Tree (/.e., SAN) 


> Example: CXFS (SGI), SNFS (Quantum, formerly ADIC), QFS (Sun) 


e ideal for smaller numbers of nodes 


- SNFS scales to 50+ nodes 
-CXFS scales up to 64+ nodes (appropriate for many-processor Altix systems) 
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AN File Systems 
LAN 


client | | client | | client | | client 
server! |server| |Server) | server 











File system protocol is concentrated in the metadata server and is not 
done in parallel; all file client/server nodes must coordinate file access 


via the metadata server. There are generally no client only nodes in 
this type of cluster, and hence the need for large scaling is not needed 
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Multi-component Clustered File Systems 


> Satisfies the definition of a clustered file system 


> File access Is parallel 
e supports POSIX API, but provides safe parallel file access semantics 
-guarantees portability to other POSIX based file systems 
- File system overhead operations 
eLustre: 1 metadata server per file system (with backup) accessed via LAN 
-potential bottleneck (deploy multiple file systems to avoid backup) ee 
e Panasas: the "director blades" manages protocol 
-each "shelf" contains a director blade and 10 disks accessible via Ethernet 
-this provides multiple metadata servers reducing contention 
> Multi-component architecture 
e Lustre: file clients, file servers, metadata server 
ePanasas: file clients, director blade 
-director blade encapsulates file service, metadata service, storage controller operations 


> file clients access file data through file servers or director blades via 
the LAN 


> Examples: Lustre, Panasas 
è Lustre: Linux only, Panasas: Linux and Windows. Do OO disks really add 
e Object oriented disks value to the FS? Other 
-Lustre emulates object oriented disks FS's efficiently 


- . accomplish the same 
-Panasas uses actual OO disks; user can only use Panasas disks thing nes higher level. 
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Multi-component Clustered File Systems 

















Lustre Panasas 
file file file file file file file file file file 
client | | client | | client || client | | client client | | client | | client | | client | | client 
LAN LAN 









concentrated director blade director blade 
oe Ae a » j| » metadata server > metadata server 


> file server 


> file server 





> storage controller 


file file 
server) |server 
> storage controller 
E | disks disks 


Storage Controller 


While different in many ways, Lustre and Panasas are similar in that they 
both have concentrated file system overhead operations (i.e., protocol 
management). The Panasas design, however, scales the number of 


protocol managers proportionally to the number of disks and is less of a 
bottleneck than for Lustre. 
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Higher Level Parallel I/O 
> High level abstraction layer providing a parallel I/O model 
> Built on top of a base file system (conventional or parallel) 











> MPI-I/O is the ubiquitous model 
e parallel disk I/O extension to MPI in the MPI-2 standard 


e semantically richer API and semantics 
-can do things that POSIX I/O was never designed to do 


e applications using MPI-I/O are portable 
> Requires significant source code modification for use in legacy 
codes, but it has the adavantages of being a standard (e.g.., 


syntactic portability) 
> Examples: MPICH, OpenMPI 
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Which File System is Best? 


There is no concise answer to this question. 
> It is application/customer specific. 
> All of them serve specific needs. 


> All of them work well if properly deployed and used according to 
their design specs. 


> Issues to consider are 
e application requirements 


-often requires compromise between competing needs 
e how the product implements a specific architecture 
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Risk Is Inevit 
= If feasible, create multiple file systems localized to a subset of the 


disks to prevent collateral damage. 
= As an added benefit, this will allow you to have different file systems 


tuned for different access patterns. 
= When using SATA disk, configure it using RAID 6 


= Avoid single point of failure risk exposures 
= Establishing disaster recovery procedures 
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Concluding Remarks 
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= PB sized file systems are not trivial 
Do not treat them as something peripheral to your environment 
= Take time to analyze and understand your storage requirements 


= Choose the proper storage tools (hardware and software) for your 
environment 

= Storage is not the entire picture; improving I/O performance will 
uncover other bottle necks. 


= “A supercomputer is a device for turning compute-bound problems into 
//O-bound problems.” 


= Ken Batcher, Professor of Computer Science, Kent State University 


























